{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# scikit-learn中的Scaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 对测试数据集进行归一化时需要注意的事项"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 错误的测试数据集归一化的方法\n",
    "\n",
    "![错误的测试数据集归一化的方法](images/错误的测试数据集归一化的方法.png)\n",
    "\n",
    "### 正确的测试数据集归一化的方法\n",
    "\n",
    "![正确的测试数据集归一化的方法](images/正确的测试数据集归一化的方法.png)\n",
    "\n",
    "### 对训练数据集和测试数据集使用同一个均值和方差的原因\n",
    "\n",
    "![对训练数据集和测试数据集使用同一个均值和方差的原因](images/对训练数据集和测试数据集使用同一个均值和方差的原因.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sklearn中的归一化封装方法Scaler的使用\n",
    "\n",
    "### Scaler的使用流程\n",
    "![Scaler的使用流程](images/Scaler的使用流程.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn import datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = iris.data\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[5.1, 3.5, 1.4, 0.2],\n",
       "       [4.9, 3. , 1.4, 0.2],\n",
       "       [4.7, 3.2, 1.3, 0.2],\n",
       "       [4.6, 3.1, 1.5, 0.2],\n",
       "       [5. , 3.6, 1.4, 0.2],\n",
       "       [5.4, 3.9, 1.7, 0.4],\n",
       "       [4.6, 3.4, 1.4, 0.3],\n",
       "       [5. , 3.4, 1.5, 0.2],\n",
       "       [4.4, 2.9, 1.4, 0.2],\n",
       "       [4.9, 3.1, 1.5, 0.1]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X[:10, :] # 前10行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### sklearn中均值方差归一化类StandardScaler的使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "standardScaler = StandardScaler()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "StandardScaler(copy=True, with_mean=True, with_std=True)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardScaler.fit(X_train)  # 训练数据集归一化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([5.84732143, 3.0375    , 3.79017857, 1.20892857])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardScaler.mean_ # 结果中每个都是X_train中对应列的均值，因为列即特征，所以即为每个特征的均值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'StandardScaler' object has no attribute 'std_'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[1;32m<ipython-input-22-4fe55cf19055>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mstandardScaler\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstd_\u001b[0m \u001b[1;31m# 新版sklearn已经把std_去掉了\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m: 'StandardScaler' object has no attribute 'std_'"
     ]
    }
   ],
   "source": [
    "standardScaler.std_ # 新版sklearn已经把std_去掉了"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.81536143, 0.43138585, 1.75310744, 0.75113647])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardScaler.scale_ # std建议用scale替换，这里都是方差的意思"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ -6.49053738,  -8.85486221,  -1.80087024,  -0.56184959],\n",
       "       [ -9.19806055, -11.00431645,  -2.97221688,  -3.22045304],\n",
       "       [ -5.7384476 ,  -6.16804441,  -1.47549617,   0.32435155],\n",
       "       [ -7.24262715,  -8.85486221,  -1.73579543,  -0.38460937],\n",
       "       [ -5.88886556,  -5.63068085,  -1.54057099,  -0.03012891],\n",
       "       [ -8.29555283,  -9.92958933,  -2.41908097,  -1.8025312 ],\n",
       "       [ -8.29555283,  -2.94386305,  -2.77699244,  -3.04321281],\n",
       "       [ -7.69388101, -11.00431645,  -2.0937069 ,  -1.44805074],\n",
       "       [ -8.29555283,  -5.09331729,  -2.90714207,  -3.39769327],\n",
       "       [ -6.49053738,  -9.92958933,  -1.80087024,  -1.09357028]])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "standardScaler.transform(X_train)[:10] # 对整个矩阵进行归一化，不会改变X_train。取前10行看看"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = standardScaler.transform(X_train) # 对整个矩阵进行归一化，不会改变X_train,所以需要赋值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test_standard = standardScaler.transform(X_test) # 需要用同一个归一化类实例来对测试机进行归一化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "knn_clf = KNeighborsClassifier(n_neighbors=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
       "                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,\n",
       "                     weights='uniform')"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9736842105263158"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn_clf.score(X_test_standard, y_test) # 获取计算地准确率。训练数据集和测试数据集必须都进行归一化(transform)操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.34210526315789475"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "knn_clf.score(X_test, y_test) # 直接使用X_test数据可以看到预测准确率很低，这里主要就是因为X_test没有进行归一化"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## sklearn中的最值方差归一化的类叫MinMaxScaler\n",
    "\n",
    "> 所在包是`sklearn.preprocessing.MinMaxScaler`，不是很常用，一般我们还是用`sklearn.preprocessing.StandardScaler`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
